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Human emotion recognition has emerged as a vital research area in recent years 
due to its widespread applications in psychology, healthcare, education, enter- 
tainment, and human-robot interaction. This research article comprehensively 
analyzes a machine learning-based six-emotion classification algorithm, focus- 
ing on its development, evaluation, and potential applications. The study aims 
to assess the algorithm’s performance, identify its limitations, and discuss the 
importance of selecting appropriate image descriptors for accurate emotion clas- 
sification. The algorithm achieved an overall accuracy of 92.23%, showcasing 
its potential in various fields. However, the classification of specific emotions, 
particularly “excited” and “afraid”, demonstrated some limitations, suggesting 
further refinement. The study also highlights the significance of choosing suit- 
able image descriptors, with the manual distance calculation used in the frame- 
work proving effective. This article offers insights into developing and evaluat- 
ing a six-emotion classification algorithm using a machine learning framework, 
emphasizing its strengths, limitations, and possible applications in multiple do- 
mains. The findings contribute to ongoing efforts in creating robust, accurate, 
and versatile emotion recognition systems that cater to the diverse needs of var- 
ious applications across psychology, healthcare, robotics, education, and enter- 
tainment. 
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1. INTRODUCTION 

Human emotion recognition is a rapidly growing field that has gained significant attention in recent 
years due to its potential applications in various areas, including psychology, healthcare, education, and enter- 
tainment. Emotions are complex and subjective experiences crucial to communication, decision-making, and 
well-being. Understanding emotions is essential for effective human-robot interaction, personalized mental 
health interventions, and many other applications. 

Recent technological advances have made it possible to detect and analyze human emotions using 
various techniques. Researchers have explored different approaches to recognizing and classifying emo- 
tions accurately, including machine learning algorithms, such as support vector machines, artificial neural 


Journal homepage: http://ijra.iaescore.com 


316 o ISSN: 2722-2586 


networks, and deep learning models. Other techniques involve analyzing physiological signals, such as elec- 
troencephalography (EEG), electrocardiography (ECG), and galvanic skin response (GSR), to extract features 
related to emotional responses. 

Emotion recognition systems have numerous potential applications in various fields. For example, 
we can use emotion recognition to monitor and improve mental health conditions like depression, anxiety, and 
post-traumatic stress disorder (PTSD). In entertainment, emotion recognition can personalize video and audio 
content to the viewer’s emotional state. In security, emotion recognition can detect and prevent crimes by 
identifying suspicious behavior and emotional states. 

Emotion recognition is an important research field that has gained considerable attention in recent 
years due to its applications in various areas such as psychology, healthcare, robotics, education, and entertain- 
ment. Emotion recognition refers to identifying and analyzing human emotions from multiple sources, such 
as facial expressions, physiological signals, speech, and gestures. In this literature review, we will discuss 
the recent advances in emotion recognition from different modalities and the methods proposed by various 
researchers. 


2. LITERATURE REVIEW 

This section covers various state-of-the-art approaches to emotion recognition. We discuss four areas 
of emotion recognition: facial, speech, hybrid, and physiological signal-based methods. Numerous studies and 
techniques emphasize deep learning approaches and multimodal systems for improved accuracy. The research 
spans multiple applications, for example, human-computer interaction, real-world deficits, and physiological 
signals. 


2.1. Facial emotion recognition 

Jain et al. contributed in classifying each image into one of six facial emotion classes. Bala- 
subramanian et al. covered the datasets and algorithms used for facial emotion recognition (FER). The 
algorithms used were Gabor filters [3], a histogram of oriented gradients (HoG) [4], and local binary pattern 
(LBP) for feature extraction [6]. Hassouneh et al. aimed to classify physically disabled people (deaf, 
dumb, and bedridden) and autism children’s emotional expressions based on facial landmarks, and electroen- 
cephalograph (EEG) signals using convolutional neural network (CNN), and long short-term memory (LSTM) 
classifiers by establishing an algorithm to recognize real-time emotion using virtual markers through an op- 
tical flow algorithm that works effectively in uneven lighting and subject head rotation (up to 25°), multiple 
backgrounds, and various skin tones. Mellouk et al. aimed to study recent works on automatic facial 
emotion recognition (FER) via deep learning. Deep learning techniques in human-computer interactions were 
employed in [9] on the advancement of artificial intelligence as an efficient system application procedure. 
Hayes et al. studied about how understanding real-world deficiencies and task selection in upcoming emo- 
tion recognition studies are affected by the variability in the age effects of various facial emotion recognition 
task designs. Ulusoy et al. [11] examined patients with BD, their parents, and healthy controls’ capacity to 
recognize and distinguish between facial emotions. 


2.2. Speech emotion recognition 

Zhang et al. presented a novel attention-based fully convolutional network for speech emotion 
recognition. Albanie et al. considered learning embeddings for speech classification without access to la- 
beled audio. Deep learning techniques are utilized as an alternative to traditional approaches in speech emotion 
recognition. Khalil et al. discussed some recent literature where these methods are used for speech-based 
emotion recognition. At the emotion classification stage, an algorithm was proposed to determine the structure 
of the decision tree [15]. By utilizing the characteristics of CNN in modeling contextual information, Latif 
et al. proved that there is still potential to enhance the performance of emotion recognition from raw 
speech. Both verbal and nonverbal sounds within an utterance were thus considered for emotional recogni- 
tion of real-life conversations [17]. To create more accurate multimodal feature representations, Xu et al. 
suggested using an attention mechanism to understand the alignment between speech frames and text words.. 
Koduru et al. mainly contributed in improving a system’s speech emotion recognition rate using the dif- 
ferent feature extraction algorithms. Siriwardhana et al. explore the use of modality-specific ’” BERT-like” 
pre-trained self-supervised learning (SSL) architectures to represent both speech and text modalities for the 
task of multimodal speech emotion recognition. Pepino et al. proposed a transfer learning method for 
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speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using 
simple neural networks. 


2.3. Hybrid approaches for emotion recognition 

Albraikan et al. presented a hybrid sensor fusion approach based on a stacking model allowing 
information from various sensors and emotion models to be jointly incorporated within a user-independent 
model. Pane et al. proposed strategies incorporating emotion lateralization and ensemble learning ap- 
proach to enhance the accuracy of EEG-based emotion recognition. Alswaidan et al. critically surveyed 
the state-of-the-art research for explicit and implicit emotion recognition in text. He discussed the different 
approaches in the literature, detailed their main features, advantages, and limitations, and compared them with 
tables. During human-computer interaction, it might be tricky to automatically recognize facial emotions. 
Sandhu et al. used the hybrid CNN approach to recognize human emotions and categorize them into sub- 
categories based on their features. A hybrid system consisting of three feature extraction stages, dimensionality 
reduction, and feature classification was proposed for speech emotion recognition (SER) [26], [27]. 

Moreover, a novel emotion recognition system, based on a number of modalities, including electroen- 
cephalogram (EEG), galvanic skin response (GSR), and facial expressions, was introduced [28]. Siddiqui et 
al. presented a multimodal automatic emotion recognition (AER) framework capable of accurately differ- 
entiating expressed emotions. To predict emotions by examining facial expressions in an image, a convolution 
neural network (CNN)-based deep learning method has been proposed [30]. 


2.4. Physiological signal-based emotion recognition 

Shu et al. gave an in-depth analysis of physiological signal-based emotion recognition that cov- 
ered emotion models, emotion elicitation techniques, published emotional and physiological datasets, features, 
classifiers, and the entire framework for emotion recognition based on physiological signals. Li et al. 
presented an extensive and organized taxonomy for recognizing emotions based on physiological signals. 
The emotion recognition methods based on multi-channel EEG and multimodal physiological signals are re- 
viewed [33]. Kim et al. presented a robust physiological model called a deep physiological affect network 
(DPAN) for recognizing human emotions. Li et al. [35], proposed a multimodal attention-based BLSTM 
network framework for efficient emotion recognition. The work attempts to fuse the subject individual EDA 
features and the external evoked music features [37]. Yin et al. proposed an end-to-end multimodal frame- 
work, the one-dimensional residual temporal and channel attention network (RTCAN-1ID). Chen et al. 
proposed a single SP-signal-based method for emotion recognition. Stappen et al. offered four different 
sub-challenges: 1) MuSe-Wilder and ii) MuSe-Stress, which concentrate on continuous emotion (valence and 
arousal) prediction; iii) MuSe-Sent, which requires participants to identify five classes for valence and arousal; 
and iv) MuSe-Physio, which asks participants to predict a novel aspect of ’physiological-emotion.” For this 
year’s challenge, the Ulm-TSST dataset, which displays people in stressful depositions, is introduced [40]. 
To fill a gap in the present literature, Ahmad et al. reviewed the impact of inter-subject data variance on 
emotion recognition, essential data annotation techniques for emotion recognition and their comparison, data 
pre-processing methods for each physiological signal, data splitting techniques to enhance the generalization 
of emotion recognition models and multiple multimodal fusion methods and their comparison. 


3. RESEARCH METHOD 

The complete methodology of the emotion detection framework is shown in Figure First, we 
capture the image and crop it to the processing size. Next, we perform RGB to grayscale conversion and apply 
histogram equalization. Next, Canny edge detection and Hough circle transform are performed to locate a 
person’s eyes. Next, we identify the critical points in the image as image descriptors for classification. We 
explain the process in detail in this section, and the working of the algorithm shown in Figure Plis elaborated 
in this section. 


3.1. Image capturing 

We use a high-end webcam, the Logitech Brio 4K, designed for professional use. It can capture video 
in 4K Ultra HD at 30 frames per second or 1080p or 720p at up to 60 frames per second. It has advanced 
features such as autofocus and 5x digital zoom and supports high dynamic range (HDR) imaging for improved 
color and contrast in difficult lighting conditions. 
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Figure 1. Research methodology of our emotion detection framework 
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Figure 2. Snapshots showing the step-by-step implementation of our human emotion detection framework 


3.2. Conversion to grayscale image 

We convert the captured image in the red, green, and blue (RGB) color space to a grayscale image in 
a single channel. Grayscale images are often used in computer vision tasks, such as object recognition, image 
segmentation, and edge detection, as they reduce the complexity of the image by eliminating color information 
while preserving the overall structure and contrast of the image. In a grayscale image, each pixel is represented 
by a single channel, with values ranging from 0 to 255, where 0 corresponds to black and 255 to white. 


3.3. Histogram equalization 

Next, we apply histogram equalization to the grayscale image. Histogram equalization is a method of 
contrast enhancement in digital image processing. Itis a technique that redistributes the pixel intensities in an 
image to make the overall image contrast better. In other words, it adjusts the dynamic range of an image by 
spreading the intensity levels over the whole range. It is done by calculating a histogram of pixel intensities in 
the image and then modifying the pixel values so that the histogram becomes more evenly distributed. It can 
benefit images with a very narrow or compressed range of pixel intensities, making them appear flat or low in 
contrast. The result of histogram equalization is an image with higher contrast and better visibility of details. 


3.4. Edge detection 

Edge detection is a standard image processing technique used to identify and highlight the edges or 
boundaries within an image. The edges in an image represent areas of rapid changes in brightness or intensity, 
such as the boundaries between objects or the contours of shapes. The edge detection algorithm works by 
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analyzing the intensity differences between adjacent pixels in an image and identifying areas where there is a 
sharp change in intensity. 

There are several algorithms for edge detection, but some of the most common ones include Sobel [42], 
Canny [43], and Roberts operators [44]. The Sobel operator calculates the image intensity gradient in horizontal 
and vertical directions. In contrast, the Canny operator uses a multi-stage algorithm that includes smoothing, 
edge detection, and hysteresis thresholding to produce high-quality edge detection results. The Roberts operator 
is a simple but effective operator that calculates the gradient using a pair of 2x2 kernels. We apply the Canny 
edge detection technique to extract features from our images. 


3.5. Hough circle transform 

The Hough circle transform [45], is a feature extraction technique used in digital image process- 
ing to detect circular shapes in images. It is an extension of the Hough transform algorithm to detect straight 
image lines. The Hough circle transform converts the image from Cartesian coordinates to polar coordinates, 
representing circles as points in a two-dimensional parameter space. Each point in this parameter space cor- 
responds to a circle in the original image, with the radius and center coordinates of the circle encoded in the 
coordinates of the point. We use the Hough circle transform algorithm to detect the iris of the human in our 
binary images provided by the Canny edge detection techniques. 


3.6. Image registration 

Image alignment, also known as image registration, is the process of aligning multiple images of the 
same scene or object. The goal is to find a transformation that maps one image onto another so that they 
are in the same coordinate system. Image alignment is important in various applications, including computer 
vision, remote sensing, medical imaging, and astronomy. Next, we align the eyes in the images via the image 
registration method. 


3.7. Keypoints detection and image descriptors 

Keypoint detection, also known as interest point detection, is a technique in computer vision that 
identifies and localizes distinctive features or points in an image. These key points are regions in an image 
with certain properties, such as high contrast, sharp edges, or corners, making them easily distinguishable 
from the surrounding areas. The process of keypoint detection typically involves analyzing an image using a 
series of algorithms to identify areas likely to be key points. Some popular algorithms for keypoint detection 
include Harris corner detection [47], scale-invariant feature transform (SIFT) [48], speeded-up robust features 
(SURF) [49], and ORB (Oriented FAST and Rotated BRIEF) [50]. 

We use a manual keypoint detection technique for the classification of emotions in humans for our 
methodology. We detect the bottom of the forehead by joining a line between the eyes in the images and 
locating its center. Next, we calculate the distances from the bottom of the forehead to key point locations 
like lips, cheeks, ears, chin, and forehead. We calculate 28 key points that serve as image descriptors for the 
classification methods. 


3.8. Classification 

Image classification is a task in computer vision that assigns a label or category to an input image. 
Image classification aims to teach a computer to recognize visual patterns in images and to classify them into 
one of several pre-defined categories or classes. This task is usually accomplished by training a machine 
learning model on a large dataset of labeled images, where the labels represent the correct category or class of 
the image. 

There are several techniques used for image classification, including traditional machine learning 
algorithms such as support vector machines (SVM) and k-nearest neighbors (KNN) [52], [53], Mixture of 
Experts (MoE) [54], AdaBoost [55], as well as deep learning methods such as convolutional neural networks 
(CNN) [56]. Deep learning has become the dominant approach for image classification in recent years due to 
its ability to learn complex features directly from raw image data. We constructed the dataset manually with 
annotated 1800 images of persons in six different emotions. We took fifty candidates and captured six images 
of each person for every six emotions. Next, we divided the dataset into 75% (1350) images for training while 
the rest, 25% (450), were utilized to evaluate or generalize the ANN. The datasets are equally divided for each 
emotion to remove any inherent dataset bias. 

We use an artificial neural network (ANN) with 28 input neurons, 56 second-layer hidden neurons, 
and six neurons in the output layer. The six output neurons classify human emotions in images from fear, 
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anger, sadness, happiness, excitement, and normal. The learning algorithm used is the Levenberg-Marquardt 
backpropagation algorithm [57|-[60]. The network converges to the goal at around 252 epochs as shown in 
Figure[3|and the output of the ANN, in this case, normal is demonstrated in Figure[4] 
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Figure 3. Training curve for the artificial neural network model shows convergence to goal around 250 epochs 
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Figure 4. Output for the trained artificial neural network model shows detection of “normal” mode 


4. RESULT AND DISCUSSION 

The study’s results reveal the effectiveness of the six-emotion classification algorithm in accurately 
classifying emotions using a machine learning framework. The algorithm’s overall accuracy of 92.23% demon- 
strates its potential for practical applications in healthcare, marketing, education, and human-computer interac- 
tion. The successful classification of emotions using machine learning algorithms can enhance human-machine 
interactions, personalized user experiences, and informed decision-making processes in various fields. 

The confusion matrix generated in the study offers a deeper understanding of the algorithm’s perfor- 
mance as shown in Figure [5] It highlights that the algorithm had difficulties classifying “excited” and “afraid” 
emotions, with 14 and 8 false classifications, respectively. A closer analysis of the confusion matrix indicates 
that most false classifications occurred between “excited” and “happy” and between “excited” and “afraid”. 
The algorithm also exhibited confusion between “excited” and “happy” and between “normal” and “happy”. 
These misclassifications suggest that the algorithm might require further refinement to improve its performance 
in distinguishing emotions with similar expressions or characteristics. 
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Figure 5. Confusion matrix for emotions classification from 450 test images of random people dataset 


Another key finding of the study is the effectiveness of the manual distance calculation used as the 
image descriptor in the machine learning framework. It highlights the importance of selecting appropriate 
image descriptors for accurate emotion classification. Choosing the right image descriptor can significantly 
impact the algorithm’s performance and the overall accuracy of emotion classification. 

While the results demonstrate the potential of machine learning algorithms for accurately classifying 
emotions, it is essential to acknowledge the limitations in classifying specific emotions, particularly “excited” 
and “afraid”. Further research and development of the algorithm could focus on addressing these limitations 
and enhancing its performance in classifying these challenging emotions. It might involve exploring alternative 
or additional image descriptors, refining the feature extraction process, or incorporating other machine-learning 
techniques to improve classification accuracy. 


5. CONCLUSION 

The research article’s main conclusions highlight the potential and effectiveness of machine learning 
algorithms in accurately classifying emotions, an area of growing research interest with significant implications 
for fields like healthcare, marketing, education, and human-computer interaction. The evaluated six-emotion 
classification algorithm achieved an overall accuracy of 92.23%, showcasing its potential for practical appli- 
cations in these fields. However, the confusion matrix generated during the study revealed some limitations in 
classifying specific emotions, particularly “excited” and “afraid”. The majority of false classifications occurred 
between “excited” and “happy” and between “excited” and “afraid”. It indicates the need for further algorithm 
refinement to improve its performance in distinguishing emotions with similar expressions or characteristics. 
The article also emphasizes the importance of selecting appropriate image descriptors for accurate emotion 
classification. The manual distance calculation used as the image descriptor in the machine learning frame- 
work proved effective, suggesting that the choice of image descriptor plays a significant role in determining 
the algorithm’s performance and overall accuracy. The article’s conclusions underscore the effectiveness of 
a six-emotion classification algorithm using a machine learning framework in emotion classification. How- 
ever, addressing the limitations in classifying specific emotions will enhance the algorithm’s performance and 
expand its practical applications in various fields that rely on accurate emotion recognition. 
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